Extracting Multilingual Lexicons from Parallel Corpora
نویسندگان
چکیده
The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages.
منابع مشابه
Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
While parallel corpora are an indispensable resource for data-driven multilingual natural language processing tasks such as machine translation, they are limited in quantity, quality and coverage. As a result, learning translation models from nonparallel corpora has become increasingly important nowadays, especially for low-resource languages. In this work, we propose a joint model for iterativ...
متن کاملComparing languages from vocabulary growth to inflection paradigms: A study run on parallel corpora and multilingual lexicons Comparando lenguas desde el léxico a paradigmas de flexión: un estudio sobre corpus paralelo y léxicos multilingües
In this paper we report on a corpora and lexical comparative study on how to compare the difficulties of five languages (English, German, Spanish, French and Italian) for morphosyntactic analysis and the development of lexicographic resources. Experiments were conducted on two different sets of multilingual parallel corpora and two different morphosyntactic lexicons per language. We measure and...
متن کاملAutomatic processing of multilingual medical terminology: applications to thesaurus enrichment and cross-language information retrieval
OBJECTIVES We present in this article experiments on multi-language information extraction and access in the medical domain. For such applications, multilingual terminology plays a crucial role when working on specialized languages and specific domains. MATERIAL AND METHODS We propose firstly a method for enriching multilingual thesauri which extracts new terms from parallel corpora, and seco...
متن کاملUsing Multilingual Resources for Building SloWNet Faster
This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries fr...
متن کاملA Cheap and Fast Way to Build Useful Translation Lexicons
The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some detail in terms of precision, recall and processing time. We conclude by briefly presenting some of our applications of the mu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computers and the Humanities
دوره 38 شماره
صفحات -
تاریخ انتشار 2004